HydeNetThe analyses performed in this example are intended to illustrate the process of estimating survival times in hybrid decision networks using the HydeNet package. It also gives a brief explanation of how the predicted survival times are calculated. The analyses are not particularly practical and should not be taken seriously.
Beginning in HydeNet 0.10.4, you may include survival analysis nodes within a network. These nodes have the same appearance and behavior as other nodes, and only require that you use a parametric survival model with the function survival::survreg. The only significant difference in behavior is that you must explicitly define the nodeFormula argument in setNode.
To explore the development of survival models in HydeNet, we may use the veteran dataset from the survival package. The veteran dataset comes from a randomized trial of lung patients for two types of treatment. A sample of the data, after some formatting, are presented in Table 1.
library(survival)
library(dplyr)
data(veteran)
veteran %<>%
mutate(trt = factor(trt,
levels = 1:2,
labels = c("standard", "test")),
prior = factor(prior,
c(0, 10),
c("no", "yes")))
head(veteran) %>%
dust(caption = "Sample of Veterans Data") %>%
hyde_medley()
| trt | celltype | time | status | karno | diagtime | age | prior |
|---|---|---|---|---|---|---|---|
| standard | squamous | 72 | 1 | 60 | 7 | 69 | no |
| standard | squamous | 411 | 1 | 70 | 5 | 64 | yes |
| standard | squamous | 228 | 1 | 60 | 3 | 38 | no |
| standard | squamous | 126 | 1 | 60 | 9 | 63 | yes |
| standard | squamous | 118 | 1 | 70 | 11 | 65 | yes |
| standard | squamous | 10 | 1 | 20 | 5 | 49 | no |
A quick univariable analysis, presented in Table 2, illustrates that is apparent balance between the standard and test groups. The balance by celltype is a little weak, but probably wasn’t considered in the randomization anyway.
Table 2: Evaluation of the Balance of Randomization
| standard | test | ||||||
| Factor | Total | N | Statistics | N | Statistics | Test Statistic | p-value |
| time a | 137 | 69 | 115.14 (90.15, 140.53) | 68 | 128.21 (87.78, 177.93) | 2564 | 0.35W |
| celltype b | 137 | 7.04 | 0.071C | ||||
| squamous | 35 | 15 | 42.86 | 20 | 57.14 | ||
| smallcell | 48 | 30 | 62.5 | 18 | 37.5 | ||
| adeno | 27 | 9 | 33.33 | 18 | 66.67 | ||
| large | 27 | 15 | 55.56 | 12 | 44.44 | ||
| karno a | 137 | 69 | 59.2 (54.64, 63.41) | 68 | 57.93 (53, 62.6) | 2419.5 | 0.75W |
| diagtime a | 137 | 69 | 8.65 (6.84, 10.93) | 68 | 8.9 (6.4, 12.21) | 2564.5 | 0.35W |
| age a | 137 | 69 | 57.51 (54.93, 60.1) | 68 | 59.12 (56.51, 61.59) | 2175 | 0.46W |
| prior b | 137 | 0.02 | 0.89C | ||||
| no | 97 | 48 | 49.48 | 49 | 50.52 | ||
| yes | 40 | 21 | 52.5 | 19 | 47.5 | ||
a Mean (95% Bootstrap CI); b Percentage
C: Pearson’s Chi-squared test
W: Wilcoxon rank sum test with continuity correction
The simplest representation of these data in HydeNet is developed as a single model with node time where all of the independent variables are parents.
library(HydeNet)
library(magrittr)
library(ggplot2)
Lung_simple <-
HydeNetwork( ~ time | trt * celltype * karno * diagtime * age * prior,
data = veteran)
plot(Lung_simple)
Looking at the parameter specifications for the time node shows us that HydeNet assumes time is to be calculated from a linear regression. We need to explicitly tell HydeNet to use a parametric survival regression.
print(Lung_simple, time)
## A Probabilistic Graphical Network
## Has data attached: Yes
##
## time | trt * celltype * karno * diagtime * age * prior
## dnorm(mu = fromData, tau = fromData)
## lm: time ~ trt + celltype + karno + diagtime + age + prior
Lung_simple %<>%
setNode(time,
nodeType = "dnorm",
nodeFitter = "survreg",
nodeFormula = Surv(time, status) ~ trt + celltype + karno +
diagtime + age + prior,
fitterArgs = list(dist = "weibull", scale = 1),
mu = fromData(),
tau = fromData())
Notice that we declared the nodeType for the time node to be "dnorm", just as we would have for linear regression. We are able to do this because the outcome we are predicting is time, and the prediction has an approximately normal distribution. The standard error of the predictions is calculated, in matrix notation, as
\[ X \cdot VV \cdot X' \]
Where \(X\) is the matrix of independent variables, \(X'\) is it’s transpose, and \(VV\) is the variance covariance matrix. These matrices are determined from the model built by survreg and written into the JAGS model. As we’re about to see, this adds quite a bit of additional text to our JAGS code. But for the most part, this additional code is transparent to our analysis. The additions appear in the code as vvmat.time, xmat.time, and xmatprime.time.
writeNetworkModel(Lung_simple, pretty = TRUE)
## model{
## vv.time[1,1] <- 0.49566; vv.time[1,2] <- -0.00209; vv.time[1,3] <- -0.04103; vv.time[1,4] <- -0.0343; vv.time[1,5] <- -0.0311; vv.time[1,6] <- -0.00232; vv.time[1,7] <- -0.00113; vv.time[1,8] <- -0.00536; vv.time[1,9] <- -0.00473; vv.time[2,1] <- -0.00209; vv.time[2,2] <- 0.03946; vv.time[2,3] <- 0.01686; vv.time[2,4] <- -0.0022; vv.time[2,5] <- 0.01126; vv.time[2,6] <- -3e-05; vv.time[2,7] <- -0.00015; vv.time[2,8] <- -0.00036; vv.time[2,9] <- -0.0036; vv.time[3,1] <- -0.04103; vv.time[3,2] <- 0.01686; vv.time[3,3] <- 0.0687; vv.time[3,4] <- 0.03624; vv.time[3,5] <- 0.0374; vv.time[3,6] <- 0.00022; vv.time[3,7] <- -0.00043; vv.time[3,8] <- -0.00031; vv.time[3,9] <- 0.00985; vv.time[4,1] <- -0.0343; vv.time[4,2] <- -0.0022; vv.time[4,3] <- 0.03624; vv.time[4,4] <- 0.07608; vv.time[4,5] <- 0.03194; vv.time[4,6] <- 0.00015; vv.time[4,7] <- -1e-05; vv.time[4,8] <- -0.00019; vv.time[4,9] <- 0.01113; vv.time[5,1] <- -0.0311; vv.time[5,2] <- 0.01126; vv.time[5,3] <- 0.0374; vv.time[5,4] <- 0.03194; vv.time[5,5] <- 0.07432; vv.time[5,6] <- -7e-05; vv.time[5,7] <- -0.00014; vv.time[5,8] <- -9e-05; vv.time[5,9] <- 0.00282; vv.time[6,1] <- -0.00232; vv.time[6,2] <- -3e-05; vv.time[6,3] <- 0.00022; vv.time[6,4] <- 0.00015; vv.time[6,5] <- -7e-05; vv.time[6,6] <- 3e-05; vv.time[6,7] <- 1e-05; vv.time[6,8] <- 1e-05; vv.time[6,9] <- -0.00016; vv.time[7,1] <- -0.00113; vv.time[7,2] <- -0.00015; vv.time[7,3] <- -0.00043; vv.time[7,4] <- -1e-05; vv.time[7,5] <- -0.00014; vv.time[7,6] <- 1e-05; vv.time[7,7] <- 8e-05; vv.time[7,8] <- 1e-05; vv.time[7,9] <- -0.00089; vv.time[8,1] <- -0.00536; vv.time[8,2] <- -0.00036; vv.time[8,3] <- -0.00031; vv.time[8,4] <- -0.00019; vv.time[8,5] <- -9e-05; vv.time[8,6] <- 1e-05; vv.time[8,7] <- 1e-05; vv.time[8,8] <- 8e-05; vv.time[8,9] <- 4e-05; vv.time[9,1] <- -0.00473; vv.time[9,2] <- -0.0036; vv.time[9,3] <- 0.00985; vv.time[9,4] <- 0.01113; vv.time[9,5] <- 0.00282; vv.time[9,6] <- -0.00016; vv.time[9,7] <- -0.00089; vv.time[9,8] <- 4e-05; vv.time[9,9] <- 0.05147
## xmat.time[1,1] <- 1; xmat.time[1,2] <- trt; xmat.time[1,3] <- (celltype == 2); xmat.time[1,4] <- (celltype == 3); xmat.time[1,5] <- (celltype == 4); xmat.time[1,6] <- karno; xmat.time[1,7] <- diagtime; xmat.time[1,8] <- age; xmat.time[1,9] <- prior
## xmatprime.time[1,1] <- 1; xmatprime.time[2,1] <- trt; xmatprime.time[3,1] <- (celltype == 2); xmatprime.time[4,1] <- (celltype == 3); xmatprime.time[5,1] <- (celltype == 4); xmatprime.time[6,1] <- karno; xmatprime.time[7,1] <- diagtime; xmatprime.time[8,1] <- age; xmatprime.time[9,1] <- prior
##
## time ~ dnorm(exp(0.00611 * age + -1.11312 * (celltype == 3) + -0.37722 * (celltype == 4) + -0.82024 * (celltype == 2) + -3e-04 * diagtime + 0.03062 * karno + -0.04948 * (prior == 2) + -0.21957 * (trt == 2) + 3.18861),
## 1 / (xmat.time[,] %*% vv.time[,] %*% xmatprime.time[,]))
## pi.trt[1] <- 0.50365; pi.trt[2] <- 0.49635
## trt ~ dcat(pi.trt)
## pi.celltype[1] <- 0.25547; pi.celltype[2] <- 0.35036; pi.celltype[3] <- 0.19708; pi.celltype[4] <- 0.19708
## celltype ~ dcat(pi.celltype)
## karno ~ dnorm(58.56934, 0.0499)
## diagtime ~ dnorm(8.77372, 0.09423)
## age ~ dnorm(58.30657, 0.09486)
## pi.prior[1] <- 0.70803; pi.prior[2] <- 0.29197
## prior ~ dcat(pi.prior)
## }
From here, we perform our analysis as usual. We compile the network model, generate a posterior distribution, and evaluate the results. As seen below, there doesn’t appear to be much of a difference in survival time between the treatments.
Post_simple <-
compileJagsModel(Lung_simple) %>%
HydePosterior(variable.names = c("time", "trt", "celltype", "karno", "diagtime",
"age", "prior"),
n.iter = 1000)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 0
## Unobserved stochastic nodes: 7
## Total graph size: 213
##
## Initializing model
options(pixie_count = 2)
Post_simple %>%
group_by(trt) %>%
do(broom::tidy(as.data.frame(.))) %>%
ungroup() %>%
filter(column == "time") %>%
dust(caption = "Comparison of Survival Times Using Basic Survival Model") %>%
sprinkle(round = 3) %>%
hyde_medley()
## Warning: attributes are not identical across measure variables; they will
## be dropped
| trt | column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| standard | time | 463 | 127.432 | 56.851 | 106.899 | 122.425 | 55.132 | 46.955 | 326.07 | 279.115 | 0.689 | -0.559 | 2.642 |
| test | time | 537 | 99.993 | 45.106 | 82.668 | 95.705 | 40.639 | 36.318 | 222.853 | 186.534 | 0.718 | -0.631 | 1.946 |
ggplot(Post_simple,
mapping = aes(x = trt,
y = time,
colour = trt)) +
geom_boxplot()
For the purposes of comparison, let’s run the model again, but let’s observer the values for the first patient in the veteran dataset to see how HydeNet compares.
fit <- survreg(Surv(time, status) ~ trt + celltype + karno +
diagtime + age + prior,
data = veteran,
dist = "weibull",
scale = 1)
predict(fit)[1]
## [1] 231.6903
veteran[1, ]
## trt celltype time status karno diagtime age prior
## 1 standard squamous 72 1 60 7 69 no
compileJagsModel(Lung_simple,
data = list(trt = "standard",
celltype = "squamous",
karno = 60,
diagtime = 7,
age = 69,
prior = "yes")) %>%
HydePosterior(variable.names = "time",
n.iter = 1000) %>%
summarise(median = median(time))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 6
## Unobserved stochastic nodes: 1
## Total graph size: 212
##
## Initializing model
## median
## 1 220.4604
Now, for the sake of illustration, let’s abandon the fact that this lung study was randomized. Instead, we will look at the data as if it were an observational study. Furthermore, let us assume we believe the following:
karno score is dependent on celltypediagtime) is dependent on karno and prior (prior therapy)With these beliefs, we construct our network as follows:
Lung <- HydeNetwork(~ diagtime | prior * karno +
age +
celltype +
karno | celltype +
trt | celltype * karno * diagtime * age * prior +
time | trt * celltype * karno * diagtime * age * prior,
data = veteran)
plot(Lung)
Before proceeding, it would be prudent to determine if HydeNet’s default models will be suitable to the data. The results are not displayed, but the code is provided.
#* Plot settings
par(mfrow=c(2, 2))
#* Default karno model is acceptable.
fit.karno <- lm(karno ~ celltype, data = veteran)
plot(fit.karno)
#* Default diagtime model is not ideal
fit.diagtime <- lm(diagtime ~ karno + prior, data = veteran)
plot(fit.diagtime)
#* Log diagtime model is better
fit.logdiag <- lm(log(diagtime) ~ karno + prior, data = veteran)
plot(fit.logdiag)
#* Restore plot settings
par(mfrow = c(1, 1))
For convenience, we will create a variable for the log-diagnostic time in our data set and reconstruct the network appropriately.
veteran %<>%
mutate(log_diagtime = log(diagtime))
Lung <- HydeNetwork(~ log_diagtime | prior * karno +
age +
celltype +
karno | celltype +
trt | celltype * karno * log_diagtime * age * prior +
time | trt * celltype * karno * log_diagtime * age * prior,
data = veteran)
plot(Lung)
First, we define the model for time, as was done previously.
Lung %<>%
setNode(time,
nodeType = "dnorm",
nodeFitter = "survreg",
nodeFormula = Surv(time, status) ~ trt + celltype + karno +
log_diagtime + age + prior ,
fitterArgs = list(dist = "weibull", scale = 1),
mu = fromData(),
tau = fromData())
Lastly, since we have assumed the decision to treat is based on physician judgment, we will declare trt a decision node so that we can evaluate if patients would do better on one treatment or another.
Lung %<>%
setDecisionNodes(trt)
plot(Lung)
Our next step is to estimate the survival time for the first patient in the veteran data set.
Post <-
Lung %>%
compileDecisionModel(data = list(celltype = "squamous",
karno = 60,
log_diagtime = 1.94591,
age = 69,
prior = "no")) %>%
HydePosterior(variable.names=c("trt", "time"),
n.iter = 1000)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 0
## Unobserved stochastic nodes: 7
## Total graph size: 305
##
## Initializing model
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 6
## Unobserved stochastic nodes: 1
## Total graph size: 304
##
## Initializing model
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 6
## Unobserved stochastic nodes: 1
## Total graph size: 303
##
## Initializing model
ggplot(Post,
mapping = aes(x = trt,
y = time,
colour = trt)) +
geom_boxplot()
It appears that survival time isn’t much different by treatment. This isn’t a terribly surprising result, considering the results of the survival model itself. The hazard ratio for treatment indicates only a 19.7% reduction in hazard for patients on the test treatment. The celltype appears to have a greater impact on survival than the treatment.
dust(fit,
descriptors = c("term_plain", "level"),
caption = "Summary of Survival Model") %>%
sprinkle(cols = c(3, 7, 8),
fn = quote(exp(value))) %>%
sprinkle(cols = 3:8,
halign = "right") %>%
sprinkle(cols = 1:2,
rows = 1,
replace = c("(Intercept)", "")) %>%
sprinkle_colnames("Term", "Level", "HR", "Std. Error",
"Z", "P-value", "LCL", "UCL") %>%
medley_model() %>%
hyde_medley()
| Term | Level | HR | Std. Error | Z | P-value | LCL | UCL |
|---|---|---|---|---|---|---|---|
| (Intercept) | 24.25 | 0.7 | 4.53 | < 0.001 | 6.1 | 96.4 | |
| trt | test | 0.8 | 0.2 | -1.11 | 0.27 | 0.54 | 1.19 |
| celltype | smallcell | 0.44 | 0.26 | -3.13 | 0.002 | 0.26 | 0.74 |
| celltype | adeno | 0.33 | 0.28 | -4.04 | < 0.001 | 0.19 | 0.56 |
| celltype | large | 0.69 | 0.27 | -1.38 | 0.17 | 0.4 | 1.17 |
| karno | 1.03 | 0.01 | 6 | < 0.001 | 1.02 | 1.04 | |
| diagtime | 1 | 0.01 | -0.03 | 0.97 | 0.98 | 1.02 | |
| age | 1.01 | 0.01 | 0.67 | 0.5 | 0.99 | 1.02 | |
| prior | yes | 0.95 | 0.23 | -0.22 | 0.83 | 0.61 | 1.48 |
Out of curiosity, let’s modify the network again to explore the hypothesis that the new treatment may benefit some cell types better than others. To do this, we’ll declare celltype a decision node.
Lung %<>%
setDecisionNodes(celltype)
plot(Lung)
Post2 <-
Lung %>%
compileDecisionModel() %>%
HydePosterior(variable.names = "time",
n.iter = 1000)
ggplot(Post2,
mapping = aes(x = trt,
y = time,
colour = trt)) +
geom_boxplot() +
facet_grid(~ celltype)
As with other HydeNet network objects, it is permissible to define network nodes using R model objects. Suppose we constructed our models for the network before building the network.
The representation below renders differently than the earlier version of the network. It is, in fact, the same network, but the nodes are placed differently due to the order in which nodes are identified and renderd by GraphViz.
fit <- survreg(Surv(time, status) ~ trt + celltype + karno +
log_diagtime + age + prior,
data = veteran,
dist = "weibull",
scale = 1)
fit.logdiag <- lm(log_diagtime ~ karno + prior, data = veteran)
LungMod <-
HydeNetwork(list(fit, fit.logdiag)) %>%
update( ~ . +
karno | celltype +
trt | age * prior * log_diagtime * celltype * karno) %>%
setDecisionNodes(trt, celltype)
plot(LungMod)
The structure of the network is the same and any subsequent analyses (obtaining the posterior distribution and the summaries that follow) are done using the same methods described above.
print(LungMod)
## A Probabilistic Graphical Network
## Has data attached: No
##
## karno | celltype
## dnorm(mu = Unspecified, tau = Unspecified)
## : karno ~ 1
##
## celltype
## dnorm(mu = Unspecified, tau = Unspecified)
## : celltype ~ 1
##
## trt | karno * celltype * age * prior * log_diagtime
## dnorm(mu = Unspecified, tau = Unspecified)
## : trt ~ 1
##
## age
## dnorm(mu = Unspecified, tau = Unspecified)
## : age ~ 1
##
## prior
## dnorm(mu = Unspecified, tau = Unspecified)
## : prior ~ 1
##
## log_diagtime | karno * prior
## dnorm(mu = -0.00596*karno + 0.96735*prior + 1.81978, tau = 1.3261374752273)
## lm: log_diagtime ~ karno + prior
##
## time | karno * celltype * trt * age * prior * log_diagtime
## dnorm(mu = exp(0.0063*age + -1.11364*(celltype==3) + -0.38005*(celltype==4) + -0.82503*(celltype==2) + 0.03076*karno + 0.0178*log_diagtime + -0.07181*(prior==2) + -0.21729*(trt==2) + 3.1433), tau = 1 / (xmat.time[,] %*% vv.time[,] %*% xmatprime.time[,]))
## survreg: Surv(time, status) ~ trt + celltype + karno + log_diagtime + age + prior